Preparations

Load the necessary libraries

library(mgcv)      #for GAMs
library(gratia)    #for GAM plots
library(broom)     #for tidy output#
library(emmeans)   #for marginal means etc
library(MuMIn)     #for model selection and AICc
library(tidyverse) #for data wrangling
library(DHARMa)    #for simulated residuals
library(performance) #for residual disagnostics
library(see)        # to visualize residual diagnostics
theme_set(theme_classic())

Scenario

In a chapter on time series analysis, Reed et al. (2007) presented Hawaiian longitudinal waterbird survey data. These data comprise winter counts of various species of stilts, coots and moorehen along with year and the previous seasons rainfall. Here, we will explore the temporal patterns in the Kauai Moorhen.

Moorhen

Format of reed.csv data file

year stilt_oahu stilt_maui coot_oahu coot_maui moorhen rainfa ll
1956 163 169 528 177 2 15.16
1957 272 190 338 273 NA 15.48
1958 549 159 449 256 2 16.26
1959 533 211 822 170 10 21.25
1960 NA 232 NA 188 4 10.94
1961 134 155 717 149 10 1 9.93
year - a continuous predictor
stilt_oahu - the abundance of the Oahu stilt
stilt_maui - the abundance of the Maui stilt
coot_oahu - the abundance of the Oahu coot
coot_maui - the abundance of the Maui coot
moorhen - the abundance of the Kauai moorhen
rainfall - the number of centimeters (or inches) of rain

Read in the data

reed_full <- read_csv('../data/reed.csv', trim_ws=TRUE) %>% 
  janitor::clean_names() %>%
  rename(moorhen = moorhen_kauai)
## Parsed with column specification:
## cols(
##   Year = col_double(),
##   Stilt.Oahu = col_double(),
##   Stilt.Maui = col_double(),
##   Coot.Oahu = col_double(),
##   Coot.Maui = col_double(),
##   Moorhen.Kauai = col_double(),
##   Rainfall = col_double()
## )
reed <- filter(reed_full, complete.cases(moorhen))
glimpse(reed)
## Rows: 45
## Columns: 7
## $ year       <dbl> 1956, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1966, 1967…
## $ stilt_oahu <dbl> 163, 549, 533, NA, 134, 175, 356, 485, 242, 209, 175, 162,…
## $ stilt_maui <dbl> 169, 159, 211, 232, 155, 282, 170, 164, 253, 188, 226, 171…
## $ coot_oahu  <dbl> 528, 449, 822, NA, 717, 12, 169, 98, 77, 106, 64, 15, 130,…
## $ coot_maui  <dbl> 177, 256, 170, 188, 149, 205, 108, 79, 75, 80, 104, 122, 7…
## $ moorhen    <dbl> 2, 2, 10, 4, 10, 12, 10, 8, 17, 7, 44, 50, 26, 10, 7, 1, 3…
## $ rainfall   <dbl> 15.16, 16.26, 21.25, 10.94, 19.93, 12.63, 20.09, 10.02, 12…

Exploratory data analysis

ggplot(reed, aes(moorhen, x = year)) +
  geom_point() +
  geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cr"), 
              method.args = list(family = 'poisson'))

We can clearly see that it is a non-linear trend that is increasing through time, which would be difficult to model using any polynomials or worse, linear terms alone. Also note that the values are all 0+ integers, so a poisson or negative binomial family of distirbution would work well for these data!

ggplot(reed, aes(moorhen, x = rainfall)) +
  geom_point() +
  geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cr"), 
              method.args = list(family = 'poisson'))

Model formula: \[ y_i \sim{} \mathcal{Pois}(\lambda_i)\\ log(\lambda_i) =\beta_0 + f(year_i) + f(rainfall_i) \]

where \(\beta_0\) is the y-intercept. \(f(year)\) and \(f(rainfall)\) indicate the additive smoothing functions of year and rainfall respectively.

Fit the model and validate

We don’t need the thin-plate to be efficient because we don’t have a ton of data (as in, thousands of data points), thus we can use cubic regression.

reed_gam1 <- gam(moorhen ~ s(year, bs = "cr") + s(rainfall, bs = "cr"), 
    data = reed, family = poisson(link = "log"), method = "REML", select = TRUE)
k.check(reed_gam1)
##             k'      edf   k-index p-value
## s(year)      9 8.581571 0.7744578  0.0350
## s(rainfall)  9 6.394206 1.3160101  0.9775

The k-check is saying that the rainfall is ok (edf below max k, i.e. k’), but the # of knots for the year smoother is potentially not enough (k-index less than 1!!), so we can double it to make sure that it is the same.

appraise(reed_gam1)

Clearly a big problem with the QQ plot, suggesting overdispersion! This will show up in the DHARMa residuals. Other plots look ok overall.

s <- simulateResiduals(reed_gam1, plot=T)

testDispersion(s) # very overdispersed

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 6.0372, p-value < 2.2e-16
## alternative hypothesis: two.sided
reed_gam2 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1) + s(rainfall, bs = "cr"), 
    data = reed, family = poisson(link = "log"), method = "REML", select = TRUE)
k.check(reed_gam2)
##             k'          edf  k-index p-value
## s(year)     18 1.727006e+01 1.440891   0.995
## s(rainfall)  9 4.521326e-04 1.121432   0.745

k-index is not less than 1 now, so looks good.

DHARMa residuals still likely to look like shite:

appraise(reed_gam2)

s <- simulateResiduals(reed_gam2, plot=T)

Still looks bad…

testZeroInflation(s) 

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 0, p-value = 1
## alternative hypothesis: two.sided

No evidence of zero-inflation.

testDispersion(s)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.7367, p-value = 0.072
## alternative hypothesis: two.sided

Overdispersed, but technically not significantly so. Still, the QQ plot is terrible, so should probably address!!

Refit and validate models

reed_gam3 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1) + s(rainfall, bs = "cr"), 
    data = reed, family = nb(link = "log"), method = "REML", select = TRUE)
k.check(reed_gam3)
##             k'       edf   k-index p-value
## s(year)     18 7.8195821 0.6980527  0.0175
## s(rainfall)  9 0.3391737 1.1980039  0.9350

Problem with this fit is that DHARMa doesn’t work with this fit!

simulateResiduals(reed_gam3)

So to use it, we need to manually extract theta, the dispersion parameter.

(theta <- reed_gam3$family$getTheta(TRUE))
## [1] 4.380278

There is more than 4x the amount of expected dispersion! Next, we can re-fit the model with theta, allowing DHARMa to work with our model.

reed_gam3 <- reed_gam3 %>% update(.~., family = negbin(link = "log", theta = theta))
k.check(reed_gam3)
##             k'       edf   k-index p-value
## s(year)     18 7.8195757 0.6980527  0.0325
## s(rainfall)  9 0.3392386 1.1980041  0.9350

K-index is low, but notice that k’ and edf are distingly different, with a lower edf, thus we do NOT need to add knots! Remember, it is a combination of all three things that determines if we need to add knots or not.

s <- simulateResiduals(reed_gam3, plot=T)

testDispersion(s)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.41874, p-value = 0.064
## alternative hypothesis: two.sided
appraise(reed_gam3)

concurvity(reed_gam3)
##                 para   s(year) s(rainfall)
## worst    2.85743e-30 0.7865911   0.7865911
## observed 2.85743e-30 0.2683207   0.4674018
## estimate 2.85743e-30 0.2267308   0.4248958

All looks good! QQ plot looks a bit weird, but not too too bad.

Partial plots

draw(reed_gam3, resid = T)

# # Using base R graphics:
# plot(reed_gam3, 
#      pages=1, shift=coef(reed_gam3)[1],
#      trans=exp,
#      resid=TRUE, cex=4, scale=0)

Model investigation / hypothesis testing

summary(reed_gam3)
## 
## Family: Negative Binomial(4.38) 
## Link function: log 
## 
## Formula:
## moorhen ~ s(year, bs = "cr", k = 9 * 2 + 1) + s(rainfall, bs = "cr")
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.81023    0.07793   48.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df  Chi.sq p-value    
## s(year)     7.8196     18 176.829  <2e-16 ***
## s(rainfall) 0.3392      9   0.416   0.247    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.813   Deviance explained = 81.7%
## -REML = 217.41  Scale est. = 1         n = 45

Evidence of a wiggly effect of year, no such evidence for rainfall, thus we may consider shifting rainfall to a linear term alone.

reed_gam4 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1) + rainfall, 
    data = reed, family = nb(link = "log"), method = "REML", select = TRUE)
reed_gam4 <- reed_gam4 %>% update(.~., family = negbin(link = "log", theta = theta))
summary(reed_gam4)
## 
## Family: Negative Binomial(4.38) 
## Link function: log 
## 
## Formula:
## moorhen ~ s(year, bs = "cr", k = 9 * 2 + 1) + rainfall
## 
## Parametric coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.7986400  0.2183542  17.397   <2e-16 ***
## rainfall    0.0006351  0.0111731   0.057    0.955    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df Chi.sq p-value    
## s(year) 7.912     18    178  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.808   Deviance explained = 81.7%
## -REML = 220.99  Scale est. = 1         n = 45
reed_gam5 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1), 
    data = reed, family = nb(link = "log"), method = "REML", select = TRUE)
reed_gam5 <- reed_gam5 %>% update(.~., family = negbin(link = "log", theta = theta))
AICc(reed_gam3, reed_gam4, reed_gam5) %>% arrange(AICc)

So definitely no support for rainfall having much of any effect!

Conclusions:

  • The average number of Moorhen is 3.81 on the link scale or 45.16 on the response scale. This number corresponds to the average number of Moorhens expected for the average year with the average rainfall.
  • There is evidence of a significantly wiggly change in Moorhen numbers over time (see s(year) term significance).
  • There is no evidence of a wiggly rainfall trend (see s(rainfall))
  • We might consider dropping the smoother for rainfall in preference for a regular linear parametric term, or even delete rainfall altogether, since the edf is already very close to zero.
tidy(reed_gam3)

Summary figures

newdata <- with(reed, list(year = modelr::seq_range(year, n=100),
                                 rainfall = mean(rainfall, na.rm=T))) %>%
  emmeans(reed_gam3, ~year, at = ., type = "response") %>%
  as.data.frame %>%
  mutate(moorhen = response, lwr = lower.CL, upr = upper.CL)

p <- ggplot(newdata, aes(y = moorhen, x = year)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "blue", alpha = 0.3) +
  geom_line()
p + geom_point(data = reed)

These data are not strictly correct!, as they are after standardizing for rainfall (another variable in our model). Thus, when we have multiple values, we need to get the marginal plots.

We need to add back the residuals to the predicted to get the ‘observed’ values, after accounting for the effect of rainfall.

reed_obs <- with(reed_gam3$model, 
                 data.frame(year = year, rainfall = mean(rainfall))) %>%
  mutate(pred = predict(reed_gam3, newdata = ., type = "link"),
         resid = reed_gam3$residuals,
         moorhen = exp(pred + resid))

p + geom_point(data = reed_obs) +
  geom_point(data = reed, col = "red")

Note the difference! Note that this is not the same as model shrinkage, even though the plot looks very similar to that in Statistical Rethinking.

References

Reed, J. M., Elphick C. S., Zuur A. F., and Ieno E. N. 2007. “Analysing Ecological Data.” In, edited by Zuur A. F., Ieno E. N., and Smith G. M., 615–32. Springer, New York.